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EMPIRICAL AND SIMULATION STUDIES 'OF 

FLEXILEVEL ABILITY TESTING . ^ ^ ' ^ 

, One result of th6 growing sophistication and availability of time- 
shared computer facilities has been increased interest in new modes of 
testing and instruction. In *the<:area of ability measurement, much research 
has been directed "at investigating various strategies of tailored (Lord, 
1970) or adaptive .(Weiss & Betz, 1973) ability testing. The general aim of 
adaptive testing procedures is to "adapt" or "tailor" the difficulty level 
of the items presented to .the ability level of an individual as estimated 
from item response patterns* Consequeatly, as testing proceeds, the items 
administered will be. increasingly appropriate for the accurate measurement 
of that individual's ability. ' ^ ^ • ^ 

Adaptive testing strategies are differentiated by the set of rules use^ 
to select items during the testing procedure (Weiss, 1974). The most 
extensively researched^^daptive strategy is t)ie pyramidal or "tree-structure' 
model. This approach/ uses a uranchjlng (pr item selection) rule in wKich, 
following a correct response to an item, t,he examinee receives a slightly 
more difficult itenl> and following an incorrect response, the examinee 
recei.ves a slightly less difficult item. Research, to date, summar|,zed by 
Weiss & Betz J1973) and Larkin &/Weiss (1974), has ^shown that pyramidal 
strategies can yield equal ♦or better reliability and validity than conven- 
tionally-radministered tes'ts while requiring substantially fewer items to 
be adminis.tered. The flexilevel test (Lord, 1971b) is a modification of 
the pyramidal strategy which would permit paper, and pencil administration 
and which would require a stnaller initial item pool than is required by 
>pyi^amidal strategies. ^ . . - 

) • • 

r Figure 1 illustrates the item structure for a flexilevel test. As ' , 
Figure 1 shows, the flexilevel tesU consists- of one i£em^-^SE^ach of a number 
of equally-spaced difficulty levels,* Item 1 in -Figure 1 is an item of 

^' ■ ' Figure 1 

ITEM STROCTURE FOR A TEN-STAGE' FLEXILEVEL TEST 




' .90 .80 .?0 ' .60 .50 , .^0 .'i-J .70 



easy , DIFFICrilY/ADILITlC i v^*^ 

approximately-median difficulty (p=.50). The even-numbered items decrease 
in difficulty with increasing distance from the median difficulty 'level . 
while the odd-numbered items increase in difficulty. 
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In the flexilevel test illustrated, ten items would be administered to 
each individual. The total item structure requites 19 items cr, in general, 
2N-1 items, where N is the number of items to be administered to each 
individual. The first item administered to all individuals is the median 
difficulty item Citem 1) for the group taking -the test. Following adminis- 
tration of the first item, a differential branching rule determines item 
selection:. following a correct response to an item, the examinee receives 
the next more difficult item previously unanswered; following an incorrect 
response, the examinee receives the next less difficult item previously 
unanswered. 

How the flexilevef test adapts item difficulties to individual differ- 
, ences in ability level can be seeji by an examination of the examples shown 
in Figure 1. Figure 2a illustrates the path through a flexilevel test for 
an examinee of relatively high ability. All testees begin with item 1, an 
item of median difficulty. Eaqh correct- answer leads to an item of higher 
difficulty; thus, correct answers to items 1, 3; 5-, 7, -9 and 11 led to the 
administration of progressively more difficult itemfe^, moving from an item 
at P-.50 to one at p=.20. Item IJ was answered incorrectly, and the next 
less difficult item not already administered was item 2, with difficulty 
P-.55. Item 2 was answered correctly, and the next more difficult item not 
already administered was item 15, with difficulty 'p=. 15. Following an 
' intorrect response to this item, item 4, with difficulty p=.60, was adminis- 
tered. Thus, this e:^aininee received ten items in the difficulty ranee of 
p=.60 to p=.15. 

Figure 2b shows how an examinee of average ability might move through 
the item structure, alternating between successively *more difficult and 
successively less difficult items. Since this examinee is of average ability, 
the. odd-numbered items (except for item 1) are tod difficult for him, and 
he answers them incorrectly; the even-numbered items are too easy for him, 
and he Answers them correctly. Thus, ai^ examinee of average ability might 
be administered ten items in the difficulty range of p=.70 to p=.25. 

Finally, Figure 2c Ulusfrates the path that might be taken by an ex^i^jinee 
of relatively low ability. Incorrect answers to items 1, 2, 4, 6, 8 and 10^ 
lead to the administration of progressively less difficult items, culminating 
in the administration of item 12, with-<if f iculty p=.80. Then, alternating 
correct and inc9rrect answers lead to.th^ administration of itehis at diffi- 

I culties p=.45v p=.85 and p=.40. Thus, this examinee received ten items 

j in the difficulty range of p=.85 to p=«'^0, 

The flexilevel test can be scored by counting the number of correct 
responses-. Lord (1971b) shows that the greater the number correct, the more 
difficult was the subset of items answered and, therefore, the higher is the 
atfildty level : of that examinee. "However, Lord also shows that examinees ' 
with the same total number correct may be further, differentiated -according 
to whether the last item was answered correctly or Incgrrectly; those who 
answered the last item incorrectly have an;swered a more difficult subset of 
items dnd have higher a^bility than those with the same tptal number coi^rect 
who ^E^sponded correctly to the last item administered. Accordingly, Lord 
proposes that-^n additional half -point be added to the number-correct scores 
of examinees responding incorrectly to the last item administered. 
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In -sunnnary, 'the flexilevel test adapts item difficulties to the ability 
level of the examinee being tested using a branching prpcedure which selects 
from the 2K-1 items available a subset of N items to. 6e administered to 
that examinee. The^N items admi-nistered are those whose difficulties are 
nearest to the examinee.^s ability level. Because of this adaptive property, 
the flexilevel test should have several advantages in comparison to convention- 
al ability testing procedures. 

First, since examinees will receive fewer items that are much too diffi- 
cult or much too easy for them, and thus fewer items nhat are inappropriate 
for the accurate measurement of their abilities, it is possible that the.* 
flexilevel test will yield ability estimates as reliable and valid as- thgse 
of conventional tests utilizing considerably more items. Stanley (1971) 
suggests that the effective length of a conventionally-administered test 
is considerably less than the total number of items administered; it is the 
purpose of' an adaptive test to select for administration those, items that 
are effective for measuring the ability of a given examinee. 

Second, the flexilevel test should provide ability estimates whose ' 
reliability and validity are more nearly equivalent for' examinees of 
different ability levels. Several reports (Baker, 1964; Levine & Lord, 
1959; Lord, 1957, 1959) haVe concluded that the precision or reliability of 
measurement for a given inidividjual is , partly dependent on his/her "true 
score." Thorndike (1951) iand Davis (1952) J among others, have shown that 
the standard error of mea^gtirement will be minimum for examinees whose 
ability levels correspond to that point on the ability/item difficulty scale 
where the item difficult ids in the test arp concentrated. On the conven- 
tional "peaked" ability test, with item difficulties concentrated around 
p=.50, the error of measurement should be minimum for examinees of average 
ability and will increase" for individuals whose ability levels deviate from 
the average. Thus, ability estimates for high and low ability examinees 
^will be less reliable than those for average ability examinees. Further 
differential errpr in test, scores is contributed by differences in the 
amount of guessing on. multiple-choice tests. While guessing^ reduces the 
reliability and validity of measurement for all subjects (e.g., Ebel, 1969; 
Frary & Zimmerman, 1970; Lord, 1957) the increase in error is greatest for 
low ability subjects. According to Nunnally (1967), on a conventional 
test where all items are attempted, low ability subjects will guess,^ the 
most because they know the' least. Thus, the flexilevel test, where item 
difficulties are concentrated arqund the ability level of each examinee, 
should yield ability estimates t?hich will tend to be equally reliable across 
the ability continuum. 

Research on Flekilevel Tests 

Research to date on flexilevel testing includes one theoretical study 
(Lord, 1971d), one real-data simulatibn study CKocher, 1974) and one live- 
testing study (OlivjLer, 1974). \ ^ 

Theoretical study . Lord*s (1971d) study comparing the measurement 
effectiveness of flexilevel and conventional tests was based on the assumptions 
and mathematics of itenjT chariicteristic curve theory (Lord & Novick, 1968). 
The flexilevel tests studied were composed ^of 60 items (thus requiring a total 
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item structure of 119 items) , all having the same discriminating power 
(normal ogive parameter a equal to ,50) and having difficulties distributed 
along the ability continuum such tnat the distance between successive-^item 
difficulties was a constant, The tests were scored usitig number-correct 
plus an additional half-point for item response patterns having th^ last 
item incorrect. The conventional or "standard" tests used for comparative 
purposes- were composed of 60 equally-discriminating items (a=.SO). In one 
•of these tests, all items were of median (normal ogive parameter b=0.0) 
dif f iculty^ The 'other two conventional tests were intended 'to measure most 
effectively or be most highly discriminating at two points on the ability 
continuum. For maximally effective or discriminating ^measurement at Q-±2, 
one conventional test had 30 items at b^2 md 30 at b=-2. For maximally 
effective measurement at 9=+3, the othei con^^entional test had 30 items at 
b=^2.8 and 30- at b^-B.d. - 

Flexilevel and conventional tests were compared in terms of information 
functions, which indicate the relative accuracy of measurement across the 
ability continuum. - The value of the information function at a given level 
of ability indicates how well the test scores obtained by individuals of 
that ability accurately reflect their "true", ability. The greater the value 
of information at a given level of ability, the more accurate is the 
measurement or, in other words, the smaller is the confidence interval for 
estimatingvtrue ability from test scores. 

Information values are not meaningful in any absolute sense because 
V they are dependent on the scale used t'o measure ability (9) , but information 
V values calculated from two or more testing strategies assuming the same 
9 scale can be directly compared, with larger values indicating more 
accurate measurement. Further^ the ratio between the two tests' information 
values at a given level of ability can be interpreted in terms of the 
relative numbers of items required to provide equal accuracy of measurement 
for individuals at that ability level. For example, if, tor a given 9 
level, the information value of one test is twice that of a second test, the 
first test provides as much information as the second test while requiring 
half the number of items. 

Lord (1971d) found that the flexilevel tests provided more information 
throughout the ability range than did the conventional tests designed to 
discriminate at two points '9=+2 or 9-+3) on the ability continuum. The 
conventional test, peaked at the median ability level (b==0.0) provided 
more information than did the flexilevel tests at ability levels around the 
median, but as ability level deviated from the average, the flexilevel 
tests provided increasingly more infarma'tion than did the conventional test. 
For example, the 60-item flexilevel test in which d, the distance between 
successive item difficulties, was equal to ..033/2a (equal to .p33 since n 
was equal to .50) measured as accurately as a 58-item conventional test at 
9=0, a 60-item conventional test at 9=+l, a 69-item conventional test at 
9=H-2 and an 86-item conventional test at 6=+3. Thus, for any examinee with 
an"^abiiity level butside the range of 9=+l, the flexilevel test provided more 
accurate measu^ment. These results were obtained under the assumption of 
no guessing. - . ^ 

The results obtained when the guessing parameter "c" was set at .2 
were similar to those obtained when no guessing was assumed, except that 
the superiority of the flexilevel test outside of the range 9=+l was more 

il 
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pronounced for the -low ability levels. For example, the flexilevel test 
measured as accurately as an 83-item conventional test at 9=+3 but as 
accurately as a 114-item conventional test at 6«-3. Thus, these data indi- 
cate that the advantage of flexilevel tests at low ability levels is 
significantly greater when correct responses are likely as the result of 
guessing. ' - 

Lorcf's finding that the peaked conventional test provided mor^ informa- 
tion for individuals of near-averagf. ability, while the flexilevel tests 
provided more information for individuals vhose ajility levels deviated 
appreciably from the mean, is in agreeiaent with his other theoretical 
studies comparing adaptive and conventional tests (Lord, 1970, 1971a,e) 
In general, the comparative efficiency or precision of measurement of 
adaptive versus conventional testing strategies as studied theoretically 
is , summarized graphically in Figure 3. Figure 3 ilius'trates that while the 

Figure 3 / ' 

A HYPOTHETICAI. ILLUSTRATION OF THE 
CaiPARATIVE MEASUREMENT EFFICIENCY 
(PRECISION OR INFORMATION) OF 
CONVENTIONAL PEAKED AND ADAPTIVE TESTS 



ADAPTIVE TEST I 



MEASUREMENT 
EFFICIENCY 




'^^ADAPT/VE 
TEST 2 



PEAKED TEST 



-3,0 -2.0 -1.0 0.0 +1.0 +2.0 +3.0 
Average 

-s ABILITY (in z-score units) 

conventional peaked test does provide superior measurement around the mem 
ability level, the accuracy of measurement of the adaptive tests Is more 
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constant across all levels of ability and exceeds that of peakted tes^s beyond 
a point above and below the mean ability level. The importance of these \ 
findings is that they indicate that an individual will be more accurately 
measured as the items administered to him/h6r are mo^e appropriate (i.e., * 
nearer in difficulty level) tip his/her "level of ability. 

However, Lord's results concerning the comparative accurady of measure-* 
ment of fle^iilevel (and other adaptive) tests and conventional tests^ are 
limited by the assumption of items with equal discriminating power, and 
having difficulty levels equal to theoreiical specifications. It is 
uncertain whether such results can be generalized to situatipns in which 
tests must be constructed using item pools containing finite numbers of, 
items having parameters that can on:.y be estimated and which do not. necessarily 
correspond tp ideal specifications. For example, in a simulation ;study of 
two-stage adaptive testing proci.dures usina real item parameters ,(Betz & 
Weiss, 1974), it was found that one two-st3ge test provided mor^ information ^ 
than did a conventional test at all ability levels, including the mean. 
Although in that .study the average discriminating. power of the items in the 
two-stage test was slightly greater than that of the conventional test 
items, they do suggest some skepticism regarding the generalizability of 
results obtained under the'assQmption of theoretically ideal items. 

Lal-data simulation study . In the study by Kocher (1974), responses 
to conventionarl test i.tems were scored as if the tests had bejen administered 
using the structure and branching rules of the flexilevel strategy. 
Th^ study used data from five previously administered conventional tests.. 
Three or these tests, consisting of 42, 36, and 36 items respectively, were 
classroom examinations administered to 180 college students enrolled in a 
junior-level "bourse in introductory educational measurements. Pearson 
product-n.oment correlation coefficients were calculated between scores on 
^^he 21, 18 and IB-item flexilevel tests and scored on the appropriate parent 
tests. , m addition, the correlation between the sum of the standard scores 
on all three flexilevel tests and the sum of the standard scores on all three 
parent tests was computed . 

The last two conventional tests were semester final examinations in a 
high school geometry course. The first group, consisting of 412 students,, 
had been- administered a 100-item examination. The second group, consisting ' 
of 485 students, had been administered a 70-item examination. Again, cor- 
relations between the -Simulated flexilevel scores and scores on the appro- 
priate parent test were calculated. 

Results indicated that the correlations between simulated flexilevel ^, 
scores and scores on the parent tests ranged from .90 to .96; the correlation 
between the two sets of summated scores obtained in the college group was .96. 
The size of these correlations, which Kocher interpreted as parallel-forms 
reliability coefficients, was taken to indicate that flexilevel scpres could 
be validly substituted for conventional test scores and have the advantage . 
of using fewer items. 

However, interpreting 'these correlations as parallel-forms reliability 
coefficients is not valid because the flexilevel items were a subset of the 
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items in each parent test. The item overlap between the two tests would 
suggest that the obtained correlation coefficients are artifactually high,. 
^ In addition, the results of the s|:udy are limited by the fact that the flexi-- 

, level tests were npt actually a'dministered to the examinees. Thus, there 
was^no allowance for, the possible psychological effects .on a*i examinee of 
taking a' test in which item difficulty is at least somewhat adapted to 
his/her ability level. 

Li^e-testing study . A study which employed paper and pencil administra- 
tion of a flexilevel test was reported by Olivier (19/4). In tbis study, 
6ighth-grade students were first administered the Florida Eighth Grade Test 
Battery. Approximately one month later, they were administered either a 
40-item conventional t^st or a ^0-item flexilevel test. The 39 Items needed 
f 6r the total flexilevel structure were the same items as were used in the 
40-item conventionjil test; these items were taken from the reading vocabulary 
subtest administered initially as part of the Eighth Grade Test Battery. 
V In order to compare th^ flexilevel test to a conventional test with the same 

nui^ber of items, three 20-item conventional subtests^ were extracted from 
the total 40-item conventional test. The three 20-item tests were constructed 
by 1) randomly selecting 20 of the 40 items; 2) selecting the even-numbered 
items; and 3) selecting the 20 items with difficulty values closest to p-.67 
(considered the optimal level of difficulty for the group when litems were 

four-altemativa multiple choice). 

^ , / 

Results showed, first, that the flexilevel test ^as less internally 
consistent and, therefore, had a larger standard error of measurement than 
any of the conventional tests. Second, the flexilevel tejst showed a lower 
correlation with an Qxtemal criterion than did the conventional tests.. 
Third, and the only result favorable to the flexilevel strategy, it was 
found that item difficulties calculated from the flexilevel administration 
were closer to p».67 and had a smaller standard deviation than the item 
difficulties as calculated on ;tie normative sample. This result indicates 
that item dif f;iculties were more appropriate for the individuals to whom 
the items were administered. 

However, this study contained several ^methodological errors which 
^ severely limit the fairness of the comparison between flexilevel and con- 
ventional testing procedures. First, a one-factor random effects analysis 
of variance model (Stanley, 1971, pp. 425-428) was used to estimate the 
internal consistency reliability of the flexilevel test. However, nearly 
all of the assumptions of this model were violated in the study — an infinitely 
large item pool from which items are randomly selected for administration 
to each subject, random assignment of subjects to treatments, and a probability 
approaching zero that two examinees will attempt the same item. Olivier 
justifies the violations on the basis of a lack of an alternative method for 
computing internal consistency reliabilities. 

Olivier claimed that the adequacy of the method of reliability 
estimation was indirectly supported by the fact that the correlation between 
the flexilevel test and the criterion was lower than that between the con- 
ventional tests and the criterion; presumably the lower correlation was due 
to the attenuation caused by the lower reliability rather than to a lesser 
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proportion of shared variance. However, the criterion itself was questionable 
on two bases. It consisted of the combined score from six other subtests 
in the' test battery: 1) reading comprehension; 2) reading essential skills; 
3) study skills; 4) occupational information; 5) mathematics problem solying; 
an< 6) mathematics essential skills. Only ^tWe two reading tests would 
appear to have any relevance as criteria for the adequacy of a vocabulary 
test. And regardless of the content of the criterion test, it is questionable 
whether another conventionally-administered test should be the only 
standard for evaluating the relative efficiency of conventional versus adaptive 
testing procedures, since the higher correlation between the two conventional 
tests could be due to method variance. 

Finally, .paper and pencil administration' of the flexilevel test was 
fpunu to present several serious difficulties which reduced the accuracy of 
the test data collected. First, over^ 10% of the flexilevel protocols had* 
to be discarded because the examinee£^ made errors in following the branching 
instructions.* Second, another 10%'of the answer sheets were found to have 
faulty ink, thus causing many examinees to misroute tifemselves even 
though they were following the directions properly. These latter protocols 
were retained in the analysis with, unknown effects on the 'resu'lts. Third, 
in order to follow the branching rules , examinees knew whether they had 
answered e^ch item correctly or^ incorrectly ; i*t is possible that such im- 
mediate feedback may have aroused anxiety in examinees given the flexilevel 
test that was not aroused in ex^inees administered the conventional test. 

The lower reliability of the flexilevel test found in Olivier *s study 
mav be related to one potentially disadvantageous characteristic of the test; 
while the flexilevel test does identify a region of the item pool of approxi- 
mately appropriate difficulty for each examinee, after the maxmally appro- 
priate difficulty level is reached the remaining items administered tend to 
be' increasingly divergent from the examinee's ability level. Reference to 
Figure 2 ptovides .an illustration of this characteristic. For the high • 
ability examinee in Figure 2a, the most appropriate level of item difficulty 
probably lies between p=.25 dnd p=.20. The first se.ven items administered 
converge on this level of difficulty, but the items administered following 
an incorrect response to item 13 are increasingly divergent^ having diffi- 
culties of p=.55, p=.15 and p=.60. For the average ability examinee in 
Figure 2b, for whom the median difficulty level is most appropriate, the^ 
items administered become progressively less appropriate to his/her ability 
level . 

The net effect of this divergence characteristic is that as .the 
flexilevel test proceeds through successive stages, the testee is administered 
a series of items which tend to alternate between items that are much too 
easy and item's that are much too difficult. ' Reliability may be reduced by 
increased amounts o5 guessing toward the end of the test and by the possibility 
. that such divergence, if perceived by the examinee, may have adverse 
psychological effects (see. Weiss, 1974, p. 43). 

Other research . A final study of flexilevel testing is currently in 
its preliminary stages. The objective of this study, as reported by Hansen, 
Johnson, Fagan, Tam and Dick (19/4), is to explore .the utility of adaptive 
testing procedures within the context of a computer-managed instructional 
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system in an Air Force technical training environment. After extensive 
review of the* literature on adaptive testing strategies, it was decided that 
the flexilevel model offered excellent poteritiiai for a 40-50% reduction in 
testing time along wfth either an increase in measurement accuracy or no 
decrease. In Phase I of this study (Hansen et al*. , 1974), a computer- 
administered flexilevel testing system was implemented. This flexilevel 
strategy differed from those, used in previous studies in that examinees 
were individually entered into the flexilevel item structure based on 
estimates of their predicted performance derived from prior ability and 
performance data. After the administration of the flexilevel test, the re- 
maining items in the structure were administered , .yielding a total "con- 
ventional test" score. Thus, both flexilevel and conventional test scores * 
were available for each individual. '.Although empirical data from this study 
have not yet been reported, Hahsen et al. state that preliminary results 
support the fieasibility and ease of implementing the flexilevel^pyocedure 
and the capacity of the flexilevel testing strategy to offer considerable 
savings in .testing time. However, firm conclusipns regarding the results- 
of this study must await the appearance of the results of empirical data 
analysis. 

t 

Summary . The studies to date of flexilevel testing havd indicate'd 
that it can provide more accurate measurement than convei|tional tc€ts for 
examinees whose ability levels differ from the average ability level Of the 
group being tested, that scores on simulated flexilevel tests correlate 
highly with scores on the parent tests from which the former scores derive, 
and that the flexilevel test does increase the appropriat;eness of item 
difficulties for examixiees^ ability levels. On the other hand, results ^ 
also indicate that the flexilevel test had lower internal consistency 
reliability and lower criterion-related validity than did conventional tests 
used for conjparative purposes. 

This conflicting series of results may be explained in part by the 
nature of the studies done. First, each study used a different research 
method; theoretical, real -data simulation, and actual t>est administration 
studies provide different kifids, of information, and each type of study is 
subject to unique limitations. Second, the studies were all limited In the 
range of evaluative criteria used; only Olivier ^s (1974) study used more 
than one criterion of evaluation. Thus, there is little opportunity to 
compare results pertaining to one criterion of evaluation across two or 
more studies. 

Objectives 

Flexilevel testing strategies have not yet been evaluated in terms of ' 
such psychometric properties as the characteristics of the score distributions 
they yield, test-retest stability, parallel-forms reliability, correlations 
with direct criteria of ability, or precision of measurement when real Item 
pools are used. The present series of studies was designed both to Increase 
the extent and variety of infi^rmation relevant to the comparison of flexi- 
level and conventional testing, procedures and to attempt to clarify the 
interpretive difficulties raised by the results of the previous three studies. 



To achieve these purposes, two related types of studies were done. 
First, flexilevel and conventional tests were computer-administered to 
college students. In view of the difficulties of paper and pencil admini- 
stration found by Olivier (1974), computer-administration was fett to be 
better able to provide examinee response records containing no errors in 
branching and to eliminate the loss of records through such errors. 
Furthermore, since the computer can select the next item to begadminirsjered 
without the testee's knowledge of whether each item was answered correctly 
or incorrectly^ computer administration might reduce somewhat the possible 
advers.e psychological effects. The second study involved Monte Carlo-' 
simulation of examinee response records for the same flexilevel and con- 
ventional tests used in the computerized administration. 

' METHOD 

Design 

The empirical study, involving the actual computer-administration of 
flexilevel and conventional tests, was designed to permit the investigation 
of 1) the characteristics of th^ score distributions yielded by glexilevel . 
and conventional tests; 2) the relationship bet,ween ability estimates 
yielded by the flexilevel and conventional tests; and 3) the test-retest 
stability of flexilevel and conventional test scores. 

Because the generalizability of results yielded by^ an empirical study 
is frequently limited by the sample size and by the characteristics of the 
subjects tested, the procedures followed in the ^empirical study were also 
followed in a Monte Carlo simulation study. Monte Carlo simulatipn involves, 
the generation of hypothetical groups of subjects and the use of either 
hypothetical or real item pools. The ability levels of the subjects and the 
item parameters are specified in advance* Then, using item characteristic 
curve theory and computer-generated random numbers, vectors of item response^ 
are generated for, a specified number of subjects. A study of this type^ 
provides no information on the psychological effects of testing on examinees 
and is limited by the assumptions used in generating response records for 
hypothetical testees, but it does provide large sample sizes and precise 
control of the characteristics of the population studied. ' 

Thus., the Monte Carlo simulation study was designed to replicate the 
procedure followed in the empirical study and al^o to provide evaluative 
information beyond that provided by the empirical study. Paralleling the 
live-testing study, the simulation study provided information concerning 
score distributions and the relationship between scores on the flexilevel^ 
and conventional tests. Simulated re-administration of the same test, which 
under the conditions of empirical test administration provided test-ietest 
stability data, provided data concerning the parallel-forms reliability of 
flexilevel and conventional tests. The availability ^of an ability criterion 
(i.e., knowledge of "true" scores) permitted the investigation of the 
relationships between ability estimates and underlying ability. Finally, 
Since the items used in the simulation study were specified to have parameter- 
identical to those items used in the empirical study, it was possible to 
replicate Lor< 's (l971d) study of the amount of information or ^precision 
of measurement .of each testing strategy using real, "non-ideal" items. 
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In^ summary, comparison of results of th^ two studies was considered 
to permit greater generality of conclusions than would be possible using 
only one method. Further, it was hoped ^ that by following similar procedures 
in two different kinds of studies, sources of method variance leading to 
different conclusions could be JLdentified. 

^ Test Construction 
Item Fool , ^ . • ' 

The item pool used to construct the flexilevel and conventional tests 
consisted of five-alternative multiple choice vocabulary items, the items 
were normed on college students, and normal ogive difficulty (b) and discrimi- 
nation (a) parameters were available for each item. Details cx)nceming the 
development and norming of the item pool are reported by MpBride and Weiss 
(1974) One characteristic of this itei^i pool relevant to the evaluation of 
the flexilevel test was that there were many highly discriminating items of 
below average difficulty but considerably fewer highly discriminating and 
difficult it^ms. Thus, in the item selection process, the more difficult 
items selected tended to be less discriminating than t\if. less difficult 
items Selected. v. 

/ Flexilevel Test ^ 

Item structure . The flexilevel test c6nstructed was one in which each 
examinee would , attempt 40 items; thus, the total item structure required 79 
items. ^ These items were selected to be distributed a^tjng the difficulty con- 
tinuum in the range of i>=-3.^ to b=H-3.0. Following tord's (I971d) procedure, 
it was desired that the distance, <i, between successive item difficulties 
be ^qual to a constant. Thus, the total range of difficulties divided by 
the"^ number of intervals between 79 items (78) led to a desired value of d 
equal t6 .075. *0f the available pool of items, only those with discrimination 
vaXue? greater than a-. 30 were considered for inclusion in the flexilevel 
structure. The criterion for a constant distance between successive item 
difficultfep was followed as closely as possible given the constraints of 
a real item pool and the minimum discrimination value required . ^"'T"^ 

The mean difficulty of the 79 items in the iflexilevel item strufctare 
was b-'-.OT; the mean discrimination value was a==,65j substantially greater 
than the minimum acceptable level. Table A-1 in the Appendix contains item 
reference numbers, item serial numbers, and difficulty and discrimination 
valued for each item in the flexilevel structure. The item serial numbers, 
from 1 to 79, follow the rank order of item difficulties, from the'^least 
difficult, b=-3.11j to the most difficult, b-2.95j and are useful in deter- 
mining the order in whicH item^ would be administered. Thus, the first 
item administered was always number 40 (b=0.0)\ under the flexilevel. 
branching rule a correct response would lead^^to the item whose serial number 
was the next larger one not previously administered, and au incorrect response 
would lead to the item whose serial number was the next smaller one not 
previously administered. 

It may be noted that in<hfeasing item serial numbers do not always 
correspond to increases in b values, (e.g., serial number 10 and 11). 



is 



This flexilevel test was constructed before the conclusion of the itew 
nonaing studies which l^d to the publication, of the characteristics of 
the final item pool (McBride & Weiss, 1974). Some of the item parameter 
estimates used in constructing the test were base^ on smaller sampl^ sizes 
than those \fi\±ch characterized the final pooj.. Further norming studies led 
to some small changes in the and "d" values characterizing certain, 
items, and these changes did in some cases reverse the rank order of item 
difficulties. Fortunately, the changes were sligHt, and should not appreci- 
ably affect the adaptive property of the flexilevel test. The item parameters 
presented in Table A-1 are the final parameters, as reported in McBride & 
Weiss (1974).^ 

Although 'the mean discrimination value of all 79 flexileyel test 
it6ms was .65, the me^n discrimination of the 40 items taken by any given , 
examinee was a function of that examinee *s ability, because of the relation- 
ship between .item difficulty and item discriminating power. For example, 
an examinee who obtained 0 correct would haVe befen administered items with 
mean a=. 75, whereas an examinee obtaining 46 correct would have encountered 
items having a mean "a" value of ♦54. The mean "a" values corresponding 
jto 10, 20,, and 30 correct would be '.74, .69 and ..62, respecftivelyx Thus, 
high ability examinees would be administered a less discriminating series 
of items than would low ability examinees. ^ 

Scoring . In the empirical study, the flexilevel test was scored using 
1) simple number correct, and 2) Lord*s (1971b) suggested modification in ^ 
which an extra half -point is added to tithe score of each examinee" responding 
incorrectly to the last item administered. Thi§ latte^ score was doubled, 
following Lord's suggestion, to eliminate the fractional values. Thus, the 
number-correct spore, which will be referred to as Score 1, could range 
from 0 to 40. The half -point score, which will be referred to, as Score 2, 
could range from 1 (the individual receiving h point, multiplied by 2, 
for ah incorrect response to the final item) to 80 (all 40 items answered 
correctly) . ^ 

J In the simulation study, only Score 2 was calculated; this score uses ' 
more information than simple number correct and was also the scoring method 
used by Lord (1971d) in his theoretical studies. In addition, preliminary 
results from the empirical study suggested that the two scoring methods 
yielded essentially equivalent results. 
r 

Conventional Test 

The conventional test, also consisting of 40 items, was the same test 
that was compared to two-stage testing procedures in studies by Betz & Weiss 
(1973, 1974). Item difficulties were concentrated around a "i" value of 
-.33 , (somewhat easier than the median ability level of the group since 
guessing was a possibility). Again, a minimum a value of .30 was required; 
the resulting 40 items had a mean a of .54. Table A-2 in the Appendix 
provides the b and a values corresponding to \^ of the 40 items in the 
conventional test as reported by McBride & Weiss (1974). The tTest was 
scored using number correct, which ranged from 0 to 40. 
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Empirical Study 

Admin is traction and Subjects 

Tests were administered to undergraduate and graduate students taking 
introductory psychology, introduction to statistics, and theory of measure- 
ment courses at the University of Minnesota! in the fall of 1972. Students 
were tested at individual cathode-ray terminals (CRT^s) connected by 
acoustical couplers to th^-^Uaiversity ^s CDC 6400 time-shared computer 
system (see DeWitt & We^.^s, 1974, for details of the computer software 
system). Items were presented on the CRT screen, and testees indicated 
their response by typing. in the number of the chosen alternative for each 
multiple-choice item. .Following their response, the nejct item appeared on 
the screen. Instructional screens explaining the operation of the CRT's 
were 'provided prior to testing? and a proctor was present in the testing 
room to provide assistance to any testee having difficulty with the e'quipment. 

Testees were permitted as much time as necessary to complete the 
tests and were so informed before test administration was begun. Testees * t» 
received no feedback during the course of testing; at the end of the test- 
ing session they were told how many items they answered correctly. 

J 

Several subject groups were utilized in this study; these groups are 
summarized in Table 1. Subjects were administered two tests on each of 
two occasions. Refereri^e to Table 1 shows tjiat 477 subjects were tested 

Table 1 



Summary. of Data Collection in the Empirical Study of 
Flexilevel Testing 





Time 1 




Time 2 




Group 


Tests Administered 


N 


Tests Adifiinistered 


N 


1 (Introductory 
Psychology) 


Flexilevel and . 
Conventional 


107 






2a (Introductory 
Statistics) . 


FlexiXevel and 
Two-stage 


107 


Flexilevel and 
Two-stage 


94 


2b (Introductory 
Statistics) 


Two-stage and *. 

Conventional 

/ 


110 


Two-stage and 
Conventional 


.85* 


3 .(Theory of 
Measurement) 


Flexilevel ^nd 
Vocabulary normlng 


153 


Flexilevel and 
Numeric norming 


131 


Total 




477 




310** 



*Re§uirted in 74 usable conventional test-r'etest rfecords 
**.Included 196 usable flexilevel test-retest records 
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on the Time 1 administration. Group 1 consisted of 107 students from the 
. introductory psychology course; these students were administered the flexi- 
level and conventional tests. Group 2a consisted of 107 students from the 
iiitroduction to statistics course; these students received a flexilevel test 
and a two-stage (see Weiss, a974, pp. 3-11*) test. Group 2b consisted of 110 
students, also from the introducifiori to statistics course, who received the 
conventional and the- two-stage , test .« Group 3 consisted of 153 graduate and 
undergraduate students from the theory of measurement course; these students 
received the flexilevel test and a series of difficult vocabulary items for 
use in cbntinued norming of the vocabulary item pool. 

Students from Group 2a were retested on the flexilevel and two-stage 
tests after an average interval of about five and one-half weeks. The 
students in Groyp 2b were retested on the conventional and two-stage tests, 
,and those in Group 3 received a flexilevel retest and a series of number 
series items as part of the norming of an item pool to measure numeric 
problem-solving abilities. Students in Group 1 were not retested^ 

As Table 1 shows, the Group 1 data permitted the analysis of the 
relationship between scores obtaiiled from flexilevel and conventional 
tests. Data 'from Groups 2a and 3 .permitted the analysis of the test- 
retest stability of flexilevel test scores, and data from Group 2b permitted 
analysis of the stability of conventional test* scores. 

* Table 1 indicates that retest records were not available for all of 
the. students tested on the first occasion. Also, of the 225 students re- 
tested on the flexilevel test, only 196 of the test-retest records were 
usable, and of the 85 students retested on the conventional test, only 74 
test-retest records were us able This loss of examinee records was largely 
due- to the failure of subject^s to .report for the retest. Comput;er failures 
during toting also contributed to incomplete and therefore unusable test 
records from both the Time 1 and Time 2 administrations. 

Order effects > Since each student^^as administered two tests on each ^ 
occasion, jthe ^ossimle effect of order of administration of the tests on 
Obtained stores was a variable of interest, as it was in previous studies 
(e.g., Betxz & Weiss,' 1973; Larkin & Weiss, 197A; Larkin & Weiss, 1975). 
To study this variable, the order of administration in groups 1 and 2 was 
randomized so that approximately half of each group would receive the 
flexilevel test first (order 1) , and ,the other half of the group would 
receive the flexilevel test second (order 2). The differences between mean 
scores from order 1 and order 2 were e:}tamined using t-tests for the 
signif icanc^e of the difference between independent means. 

Simulation Study 

/' 

The Simulation Model 

The Monte Carlo simulation procedure was initially developed for use 
in simulation studies of two-stage ability testing (Betz & Weiss, 1974); the 
procedui;e is described in detail in that report. The procedure was based 
on the assumptions and mathematics of item characteristic curve theory 
(Lord & Novick, 1968). The basic assumption made was that the probability 
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of a correct response to an item is a generalized normal ogive function of 
an examinee's ab'ility. To determine the probability of a correct response 
to an item given a specified ability level, the ability .level and the normal ^ 
ogive difficulty, discrimination, and guessing parameters corresponding 
to that item were entered .into the equation suggested by Bimbaum (1968, 
Equation 17.3) and used -by Lord (1971d,e) in his theoretical studies of ' 
flexilevel and two-stage testing. 

The use of this simulation procedure in the study of two-stage testing 
(Betz & Weiss, 197^.) yielded results which did not contradict and in. most 
cases supported results obtained from a parallel empirical study (Betz 6c 
Weiss, 1973). Thus, it was considered to have utility for use in the 
present study. 

Procedure ' • ^ 

The Computer program which "administered" the tests and calculated- 
test scores in this study was a modification of the program used in thq ' 
two-stage simulation study (see Betz & Weiss, 1974). The modification 
involved replacing the subroutine designed to administer a two-stage test 
with one that administered a flexilevel test. Following the desig^^oJf -the 
two-st^age study, two administrations of the flexilevel test and. two admin- 
istrations of the 'conventional test were simulated for two samples of hypo-(^ 
thetical testees. \ " 

One sample consisted^of 10,000 testees sampled from a normally distributed 
population; ability levels were assigned to testees using a pseudo-random 
number generator which yielded a normally distributed set of numbers with 
mean 0 and variance 1. The secotld sfample consisted of 1,600 testees, 100 
at each of 16 ability levels between 9=-3.2 and 9-3.2. The 16 ability 
levels use'd are shown in Table 10. This latter distribution of ability 
levels, the "equal-frequency" distribution, was generated ^to allow 
calculation of values of the information function that were based on equal 
sample sizes d't each selected point on the abi*lity continuum. 

Once ability level had been specified, item "administration" was begun. 
The parameters of the particular item to be administered were entered, along 
with the ability level, into the equation used to calculate the probability 
of a correct response to that item. Since the items were in ^ five- 
alternative multiple-choice format, the guessing parameter assigned to all 
items was .2, the probability of obtaining a correct response through random 
gue^ssing. Following the calculation of the probability of a correct 
response, a random number was sampled from a rectangular distribution of 
real numbers between 0 andll. If the rahdom number was less than the former 
probability, the item was*scored^ "1" (correct); if the random number was 
greater than the probability, the item was scored "0" (incorrect). , The 
item response, 1 or 0, was then used in scoring the. test and, in the flexi- 
level test, was used to determine the next item to be administered through 
the branching rules described^,previously . 

^ - Data 4nalyl5is 

The^ following data were available from the empirical study*: 1) con- 
ventional and flexilevel scores from Group 1; 2) test. and retest score 
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distributions for the flejcilevel test from Group 2'a and for the conventional 
test from Group 2b; and 3) flexilevel test ^nd retest score «distributi6ns 
from Group 3 • * * - . 

One set of data from the simulation study consisted of ability level, 
scores from the two administrations of the flexilevel test, and scores from 
the two administrations of the conventibnal test for each of 10,000 "testees" 
whose ability levels were sampled froih a normally distributed population. 

The second set of simulation data consisted of ability level and the 
scores obtained from the two administrations of the flexilevel and'conven- . 
tional tests for 1600 "testees,"^100 at each of 16 ability levels. .These 
data were used only in the calculation of values of test information functions 
at each of the 16 ability levels • 

Characteristics of Ability aad Test Score Distributions 

While it was assumed ^hat the 10,000 ability levels sampled from a 
normally distrij>uced population would be normally distributed, several 
characteristics of the resulting distribution of ability levels were 
examined to- determine whether^or riot this assumption was reasonable. The 
mean, variance, skewness, and kurtosis for the 10,000 ability levels were ^ 
calculated* These four statistics were then tested for the significance 
of their departure from expectation under the normality assumption (McNemar, 
1969,* pp* 25-28 and 87-88). 

I 

Analyses of the characteristics of the empirical and simulated test 
score distributions were done separately for each administration (test or 
retest) of the test/ In the empirical study, analyses werfe also done 
separately for each sulJject group since the three groups were expected to 
differ in mean ability level. 

Again, the mean, variance, skuwness and kurtosis were calculated for 
each test score distributicjn;^ the indices of skewness and kurtosis were 'tested 
for the significance of their departure from normality . The flexilevel 
and conventional test score 'means within each group were compared using 
t-tests for the significance of the difference between the means of dependent 
groups (e.g., Glass & Stanley, 1970, pp. 297-300). 

• Reliabil ity 

Test-retest stability . Stability data for the flexilevel test were 
available from Groups 2a and 3 and for the conventional test from Group 2b. 

Pearson product-moment correlations were calculated for the test- 
retest score distributions. To examine the effect of interval length on 
stability, the total groups were divided into three subgroups according to 
the length of the interval between test and retest. The three subgroup.^ 
were: 1) shbrt interval (13-30 days); 2) moderate interval (31-46 days); 
and 3) loitg interval (47-62 days). 

These intervals, determined so that the'three subgroups would be of 
approximately equal size, were the same intervals used in the study oi 
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' pyramidal adaptive testing proceSures (Larkin & 'Weiss, 1974). Product- 
moment correlations were calculated between test and retest scores within 
each subgroup. , 

In addition to the possibility that test-retest stability might be 
affected by interval length was the possibility that the stability of 
f lexilevel and conventional tests might be differentially af fe::ted by jnemory 
of particular items and of the "jjrevious responses to them. To the extent . 
that memory leads the examinee to repeat the same respoiises he/she made 
before, the similarity of results on two test administrations tends to be ^ - 
increased. This inflation of the stability coefficient can logically be 
assumed to be directly related to the number of items repeated on the retest. 

In re-administering the conventional test, all 40 items were repeated. 
However, the number of items repeated in an adap^tive retest varies with 
th^ adaptive strategy and with the particular individual's response patterns. 
The number of items repeated in a 40-item flexilevel test can range from 1 
to 40. In order to assess the magnitude of memory effects on the stabilities 
of the flexilevel and conventional tests, a distributibn of the number of 
items repeated on the flexilevel retest was obtained. The number of items 
repeated in a flexilevel , test is equal to the number of items in the t^st 
minus the difference between the number-correct scores obtained by an 
examinee on test and retest. The relationship between the number of 
repeated items and the size of the stability coefficient was examined. 

Parallel forms reliability . In the two simulated administrations of 
each test, examinee .ability level and the parameters assigned to each item 
were constant, thus yielding the same probability of a correct response 
for any given item- individual interaction. However, the random number 
determining the scoring of the given item varies so that simulated 
re-administration of the same tfest may yield a*different pattern of right 
and wrong answers and, in the case of the flexilevel test, differences in 
the branching pattern. Thus,, simulated re-administration of the same test 
can be used to evaluate parallel-forms reliability; while the item parameters 
are identical between the two forms, there is no specific item content 
overlap since only item poroTn^t^rs determine response patterns in ^ simulation 
study. 



The operation of the simulation computer^ program was such tha^ each 
"run" of the program provided ability levels and te^t scores for 100 
"examinees." For each group of 106, Pearson product -moment correlation 
coefficients were calculated to express the degree of relationship between 
scores obtained from the two simulated administrations of each test. Thus, 
there were 100 reliability coefficients for each test obtained from 100 
samples from a hypothetdcdl population with a normal distribution of under-- 
lying ability. The mean ana standard deviation of the obtained sampling; 
distributions were used to construct; confidence intervals indicating the 
effective range of reliability coefficients obtained in replications of the 
study using samples of 100. The 95% confidence intervals were obtained by 
adding to and subtracting from the mean the value of two standard deviations 
of the obtained sampling distribution. In addition, taking the mean of each 
sampling distr^ibution as an estimate of the population reliability fp), 
the standard errors of the mean were calculated and used to test the 
significance of the difference between the expected reliability values foi 
the conventional and flexilevel tests in the population. 
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In a previous study (Betz.& Weiss, 1974) the product-moment coefficient^ 
were transformed to Fisher's values so that ^ the effects of possible 
non-normality in the original distribution of "r coefficients on the length 
and symmetry of thfe confidence intervals around the expecte^d value could be 
evaluated. However, the expected values and confidence intervals obtained 
using the normalised values were^ found to be identical to those obtained 
from the originaljdistribution , and it was concluded that skewness in the 
latter distribution was not a factor influencing the obtained confidence 
intervals. Since the parallel-forms correlations in the present study were 
expected to be Mmilar in magnitude to those of the previous study, on|y the 
original distribution of r values was used .to derive expected values and 
confidence intervals. 

Relationships Batween Flexilevel and Conventional Test Scores 

The examinees in Group 1 were administered both the flexilevel and 
conventional tests. To analyze the relationship between the flexilevel 
and conventional test scores, product-moment correlations and eta coefficients 
for each total score distribution regressed on the other one were computed. 
Tests of curvilinearity were made to xietermine if there were non-linear 
relationships between the two score distributions.. Similar analyses were 
completed for the simulated distributions of 10,000 flexilevel and con- 
ventional test scores. 

Relationships Between Test Scores and Underlyi ng Ability 

Product-moment and eta coefficients were calculated to determine the 
nature and degree of relationship between each set of 10,000 scdres and 
the distribution of underlying abilitv. In addition, th^ characteristics 
of the sampling distribution of 100 r values obtained from the 100 samples 
of 100 "subjects'* were evaluated; confidence intervals indicating the 
effective range of values were constructed and tests were made of the 
significance of the difference between the means of the obtained sampling 
distributions. 



Information Functions - 

The information function is used to compare two or more strategies of 
testing In terms of the amount of information (or rtlative degree of accuracy 
of measurement) provided at different levels on the ability continuum. The 
valiie of information at each level of underlying ability was calculated using 
the formula suggested by Birnbaum (1968) : 

2 

(11 . 



where 1(9) indicates the amount of information provided by a given test, 
scored in a specific way, at a given level of underlying ability 9. The 
numerator in Equation 1 is the slope of the regression of observed test 
scores on underlying ability (calculated by evaluating the first derivative 
of the regression function at that value of 9), and the denominator is the 



1(0) = 



x|e 
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standard deviation of test scores obtained by testees with ability 9. This 
ratio is then squared to obtai;ri 1(6). ' ] 

According to Lord (1970), the numera^tor of Equation 1 repifesents the 
capability of test scores to d^jferentiate among examinees .^ith ability 
levels irp th^ -immediate vicinity or 9. For^example, given examinees at 
•two levels of ability, 9^ and 92* and expectfed test score values x^ and x^, 

the magnitude of t;he-^lope 



''2^^1 [2] 

9^-9 
2 1 

indicates the degree to which the test discriminates these two ability levels. 

The denominator of Equation 1 is the conditional standard error of 
measurement at 'a particular level of ability. The square root of 1(9) is 
inversely related to the confidence interval for estimating observed score 
fromuaderlying ability (Green, 1970). Thus, a low value of 1(9) indicates 
a larger standard error of measurement at a particular level of ability, 
and the higher the value of 1(9), the smaller the error of measurement. 

The procedures used to calculate the relative amount of information 
provided by the flexilevel and conventional tests for both the normal and 
"equal-frequency" distributions of ability were identical to those used in 
the earlier simulation study of two-stage testing (Betz & Weiss, 1974). 
The regression equation relating test score (the dependent variable) to 
generated ability (the independent variable) was calculated from the norma^. 
distribution data using a least squares curve-fitting program. The third 
degree polynomial equation generated was used since higher degree polynomial 
equations did not significantly reduce the standard error of estimate of 
the dependent variable (i.e., test score). The first derivative of the 
third degree polynomial was then derived so that the slope of the regression 
function could be calculated at the desired 9 levels. 

The normal ability distribution was divided into 33 intervals between 
9=-3.3 to 9=+3.3. Each interval had' a Vidth of .2, and the midpoint of the 
interval was used to calculate the slope of the funccion at that level of 
ability. Thus, the lowest ability interval was 9=-3.3 to 9=-3.1, and 
9=-3.2 was taken as- its midpoint. For each interval, the variance of the 
test scores of individuals whose generated ability level fell into that 
interval was calculated. 



When thp^ormal distribution of ability was used, however, the number 
of indivi^*rals within each interval differed at all points along the ability 
continudm. That is, since interval length was constant,^ large numbers of 
indiUMuals fell into the intervals in the middle of the continuum, while 
the/ability intervals at or near the extremes had considerably fewer Individ- 
uaifs. Thus, information values for extreme ability levels were less stable 
t|ian those nearer the middle because the score variance was more influenced 
b/ chance cirailarities or differences among scores determined for individuals 



2S 



-21- 



cf approximately the same ability. 

In order to obtain information values with more equivalent stability 
across the ability continuum, the **equal-f requency" distribution with 100 
"examinees" at each of the 16 ability levels shown in Table 10 was used* 
While the numerator of Equation 1 used slope values based on the first 
derivative of the regression equation derived from the normally distributed 
population (thus yielding slope values based on different sample si^es), 
the a , values in the denominator of Equation 1 were all calculated using 

X i @ 

samples of 100. Thus, in the "equal-frequency" distribution of ability, 
the numerator of the information equation was the slope at one of the 16 
ability levels, and the denominator was the standard deviation of the 100 
scores generated at that level. 

RESULTS 

Order Effects 

Table 2 presents the results of the analysis of the effects of order 
of administration on the means of the obtained test scores. Results are 
indicated for both methods of scoring the flexilevel test (Lord, 1971b). 
As the table indicates, there were no significant differences in mean scores 
as a function of order of administration for either of the groups. These 
results correspond to previous findings that order of administration does 
not affect scores on conventional tests (Betz & Weiss, 1973; Larkin & Weiss, 

1974) , two-stage tests (Betz & Weiss, 1973), pyramidal tests -(Larkin & 
Weiss, 1974), or two adaptive tests taken in combination (Larkin & Weiss, 

1975) . 



Table 2 



Group 

and 
Score 



Flexilevel Test Score Means and Standard 
Deviations for Subgroups Completing the 
Flexilevel Test in Different Orders 



Order 1: 
Flexilevel First 
N Mean S.D. 



Order 2: 
Flexilevel Second 
N Mean S.D. 



'-Test of 
Significance 
t df p 



Group 1 
Score 1 
Score 2 

Group 2a 
Score 1 
Score 2 



54 
54 



57 
57 



19.37 
39.17 



22.35 
45.19 



6.09 
12.00 



4.75 
9.43 



53 
53 



50 
50 



19.34 
39.17 



21.92 
44.20 



4.99 
10.00 

/ 

^6.22 
12.36 



.03 105 .97 
.00 105 .99 



.41 105 .68 
.47 105 .63 
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Ability and Test Score Distributions ^ 

Eiapirical study > Table 3 contains data describing, the flexilevel and 
conventional test score' distributions; results are presented separately 
for each subject group since the groups were expected to differ in mean 
' .ability level. 

Table 3 shows that the mean Score 1 on the first testing w^th the 
i^lexilevel test was lowest for Group 1 (19.36), next higher for Group 2a 
(22.15), and highest for Group 3 (27.08). The differences between each of 
these groiip means w,ere statistically significant (p<.01). These results 
indicate^ significant differences in ability level in the* three groups and 
were expected since Group 1 consisted of beginning undergraduate students. 
Group 2a consisted of somewhat more advanced undergraduates, most of them 
psychology majors, and Groi/p 3 consisted of honors undergraduate and graduate 
students. The standard deviations of scores in the three groups indicated 
essentially equivalent within-group variability among the groups. 

Differences among the three groups'were also Reflected by the skewness 
of the score distributions. The group 1 scores were significantly positively 
skewed, indicating a concentration of lower scores, while those of Group 3 
were significantly negatively skewed,, indicating a predominance of higher 
scares. ^ The score's of Group 2a were not skewed. The Group 2a score 
distribution was somewhat platykurtic, ^indicating a more even spread of 
scores than was the case in Group 1, where scores were normally peaked, or 
Group 3, in which the scores tended to be more peaked than is typical of a 
normal distribution. • ] 

Group differences were also reflected by mean scores on the conventional 
test. The mean score obtained by Group 1 (18.58) was significantly (p<.01) 
less than the mean score obtained by Group 2b (24.19). The variability of 
the scores in the two groups was almost identical (8.22 and 8.28). Further, 
the Group 1 conventional test scores were again significantly positively 
skewed,, while those of Group 2b were not skewed. The Group 2b scores were 
also significantly platykurtic on TimeTlTNlndicating that the distribution 
of scores was flatter than a normal distribution of scores. 

The shape of the score distributions indicates that the difficulty 
levels of both the flexilevel and conventipnal tests were most appropriate 
for the individuals sampled fr'om the Group 2 population. However, an 
analysis of the mean number-correct scores in relation to expected means 
offers further information concerning the appropriateness of t*- tests for 
measuring groups of individuals differing in ability level. * * 

The mean difficulty of the conventional, test items was Z>=-.JJ, correspond 
ing to ap (proportion correct) value of .57 in the norming sample. This 
p value should result in a mean number correct of 23 of the 40 items 
administered in samples of examinees similar in ability to the norming 
group. In Group 2b, the mean number correct on the conventional test was 
24.19, close to that expected, while the mean number correct in Group 1, 
18*58, indicates that the conventional test items were somewhat too diffi- 
cult for the group as a* whole. 
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For. the f lexileVel* test, the mean item difficulty was b=^.01, yielding 
a p value of .50 or an fsxpectation of 20 correct of the 40 it^ms administered. 
However, the flexilevel test is also designed so t;^at the items administered 
to a given examinee^ are more appropriate to his/her ability level, or in 
other words, closer to p=.50 difficulty for the examinee. Thus, while mean 
score differences on a flexilevel test should to some extent reflect group 
differences in ability, they should al^o tend to be closer to 50% correct 
in different subject groups than should mean scores on a conventional test. 

Comparing the flexilevel and conventional test score means within 
groups indicates that for Group 1, the mean number correct (fle-xilevel 
mean Score 1 equal to 19i36 and conventional score mean equal to 18.58) 
was not signjLficantly different for the two groups; however, the flexilevel 
mean was closer to the expectation of 50% or 20 items correct; than was the 
conventional test to the expectation of 57% or 23 items correct. In 
Grotip 2, the mean conventional test score (24.19) was significantly (p<.Ol) 
greater than the mean flexilevel Score 1 (22.15). The conventional test 
mean was close to its expectation, bu.t the flexilevel mean was again closer 
to 50% correct. If item difficulty were the only factor influencing the 
mean scores, a higher conventional test mean would be expected in both 
groups since these items were somewhat easier, on the average, than the 
items in the flexilevel test. Thus, it appears that the flexilevel test 
does adapt item difficulties to th^ ability levels of individuals within 
groups and across groups differing in ability. The fact that the flexilevel 
-test score means were closer to\.50 also implies less guessing on the 
flexilevel test. 

Further comparison of flexilevel and conventional test score distribu- 
tions indicates that for Group 1, conventional test scores were significantly 
more variable (p<.01) than were the flexilevel scores. Both distributions 
were significantly positively skewed, reflecting the lower ability level of 
Group 1 as a whole. The conventional score distribution showed a non- 
significant tendency toward flatness (platykurtosis) not shdwn by the 
flexileveJI^ scores. 

In Group '2, the conventional test scores were again significantly more 
variable. Neither distribution of scores was skewed, although both tended 
to "be flatter than a normal distribution; the latter tendency was statistically 
significant only for the conventional test. 

Simulation' study . The assumption that 10,000 ability levels sampled* 
from a normally distributed population would themselves be normally dis- 
tributed was accepted. The mean ability level was 0.0,^ the variance was 1.0, 
and the degrees of skewness. and kurtosis of the ability distribution did 
not show significant (departures from normality. 

Table 4 presents data describing the distributions of 10^000 scores 
generated in the simulation study. The data for the flexilevel test may 
be compared to that of Score 2 in the empirical study, as shown in Table 3, 
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Table 4 , \ 

Descriptive Data for Flexilevel and Conventional Test 
Score Distributions Generated by Monte Carlo 
Simulation With an Underlying Normal 
Distribution of Ability ' 

* ^ _^ ; 

Test N Mean S>D> Skew Kurtosis 

Flexilevel (Score 2) 

Time 1 - 10,000 46.6 10.06 -.23* -"*29* 

"*Time 2 10,000 ^,46.6 10.08 -.24* -.31* 

Conventi^on'al 

Time i 10,000 25.9 6.48 -.25* --.46* 

• Time 2 10,000 25..9 6.43 -.23* -.53* 



*Statistically significant at p<.01 

The flexilevel Score 2 mean was 46.6, corresponding closely to that 
obtained by group 2a in the empirical study (44»73). This Score 2 mean 
indicates that the mean number correct (Score 1 in the empirical study) 
for the simulated examinees was about 23. The conventional test score mean 
(25.9) was most similar tp that obtained by Group 2b in the empiricaJL study 
(24.19) • The agreement of the simulated, data: with 'that of XSroup 2 lil the 
empirical study was expected; other samples from the Group 2 population 
(introductory statistics students) comprised a large proportion of th^ 
original item norming samples, and the average ability level of this group 
was at about the mean of the norming population as a whole. 

The mean number correct on the conventional test (25.9) was significantly 
greater (p<.01) than the mean number fcorrect on the flexilevel test 
(assuming it to be 23); this result i's agaitb in agreemfjgt with that found 
for Group 2 in the empirical study. The standard deviation of the flexilevel 
number correct scores was about 5.0 (since the variability of Score 1 was 
shown in the empirical study to be roughly half tuat of Score 2) as compared 
to a standard deviation of about 6.5 for the conventional test scores. 
While the conventional scores were again more variable th^n the flexilevel 
scores, score variability for both tests in the simulation study, was 
uniformly lower than that shown in the empirical study. 

Both the flexilevel and conventional test score distributions werV 
significantly negatively skewed and' significantly platykurtic. However,^ 
the flexilevel scores were less platykurtic than were the conventional \ 
scores, indicating that the former more closely reflect the known underlyirlg, 
normal distribution of ability. The direction of skewness for the flexi- 
level scores paralleled the negative skew found for Group 3 in the empirical 
study, although the absolute degree of skewness was less in the simulation 
study. The skewness of the conventional test scores was closest in degree 
to that found in the Time 2 administration in Group 2b in the empirical 
study. The platykurtosis characterizing both simulated score distributions 
is in agreement with that shown by Group 2 in the empirical study, ^although 
ot Groups 1 and 3. ' 
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Reliability ^ 



Test-retest stability ^ Table 5 contains the test-retest stability 
correlations for the flexllevel and conventional tests, as obtained from 
the emplriaal study. The first set of columns indicates the stability 
of each test for the total group of examinees; the last three sets of 
columns show stability as a fuactlon of the length of the Interval between 
test and retest* 

Table 5 

Test-retest Stability Correlations as a Function 
of Interval Length-, and for Total Group 
(eioplrical data) 



Retest Interval (in days) 4 





Total 


Group 


13- 


-30 




31- 


-46 


47-67 


Test 


N 


r 


N 


r 




N 


r 


N r 


Flexllevel — 
Score 1 
Score 2 


194 
194 


.89 
.89 


53 


.92 
.93 


\ 

1 


91 
91 


.§6 
.86 


50 .88 
50 .817 


Conventional 


74 


.89 


25 


.89 




28 


.91 


21 .87 



The overall stability of scores on the two tests was comparable; both 
had test-»retest correlations of .89. * Stability was not clearly related 
to interval length for either testing strategy. Scores on the flexllevel 
test were most stable over the shortest interval (r«.92 or .93), but 
stability over the two longer Intervals was about the same. In contrast, 
scores on the conventional tests were most stable over the moderate Interval 
(r».91), and least stable over the longest interval (r«.87). The flexllevel 
test scores were more stable ,over a shor)t time interval than were scores 
on the conventional .test; conventional test scores were more stable in the 
moderate time interval; and scores on the two testing strategies showed 
equal stability in the long time Interval. 

' Table 6 indicates the number of examinees repeating 40, 39, 38 or 37 
or fewer items on the retest, the mean and standard deviation of the number 
correct (Score 1) obtained by each group of examinees, and the stability 
of scores within each group. 

Table 6 

Stability of Flexllevel Test Scores (Score 1) as a < 
Function of the Number of Items Repeated 



Number of • Standar^i 



Items Mean Deviation 



Repeated 


N 


Tjjne 1 


Time 2 


Time 1 


Time 2 


r 


40 


39 


27.13 


.27.13 


5.13 


5.13 


1.00 


, 39 


63 


25.03 


25.33 


6.55 


6.33 


. .99 


38 


40 


2*6.42 


26.82 


5.82 


5.98 


.94 


37 or less 


52 


22.87 


25.24 


6.60 


6.28 


.64 
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As shown in Table 6, almost three-fourths of the total group repeated 38 
or more items; only 52 of the 194 examinees repeated 37 or less. 

Examinees who repeated 40 items obtained the same, test score on both 
administrations of the test; thus,, there must of necessity be no change in 
the mean score from Time 1 to time 2 and a correlation of 1.0 between the 
twa sets of scores. For examinees who repeated 38 or 39 items, mean scores 
•showed an insignificant increase from Time i to Time 2. The stability 
of scorefe within these groups, r=.99 and r=.94, while probably partly an 
artifact of the only 1 or 2 point score changes shown from Time 1 to Time 2, 
' also indicates high consistency in the direction of score changes in terms 
of maintaining at Time 2 the rank order established on the ;rime 1 administra- 
tion. " • , ' 

The performance of examinees repeating fewer than 38 items was markedly 
different from that of the other examinees in several respects. The mean 
score on Time 1 for this group was lower than that for the other three 
and was significantly (p<.Ol) poorer than was the performance of examinees 
repeating 3^ or 40 items. However, this groyp of examinees showed a 
significant (p<.01) increase in the mean score obtained on the Time 2 
administration; this increase (ftom 22.87 to 25.24) ought the performance 
of tlie group repeating fewer then 38 items to a level, comparable to that 
of the .groups repeating 38 or more. Finally, t?est-retest stability dropped 
markedly in this group, from r=.94 in the* "38" group to r=.64 in the "37 or 
less" group. 

From these results it would appear that the overall stability of the 
flexilevel test (r=.89) reflects the combined effects of 1) a majority of 
examinees whose performance from Time 1 to Time 2 was highly stable in 
terms of both rank order and overall level of performance, and 2) a small 
group of examinees whose overall performance was initially at a signifi- 
cantly lower level than that of the larger group, whose mean score increased > 
significantly on the Time 2 testing, but who showed far less consistency 
in, rank ordering from Time 1 to Time 2. 

Paitallel forms reliability . Table 7 presents the characteristics of . 
the sampling distributions of parallel forms reliability coefficients 
• obtained from the simulation study. These data show that the flexilevel 
test was more reliable than the conventional test, having a mean reliability 
of .84 as contrasted with that of .80 for 'the conventional test. This 
difference was statistically significant at p<.001. 

Table 7 

Characteristics of Sampling Distributions of 
Parallel Forms Reliability Coefficients 
Using 100 Random Samples of 100 "Testees" 



95% Confidence Interval 
Range (±2 S.D.'s) 







Mean 


S.D. 


Maximum 


Minimum 


Upper 


Lower 


• 


, Flexilevel 


.84 


.029 


.90 


.74 


.90 


.78 




Conventional 


" .80 


.038 


.88 


.65 


.87 


.72 
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The standard deviation and range of coefficients obtained for the 
flexilevel test was also smaller than the values for the conventional 
test, indicating more consistency In the reliability estinjates obtained 
from the 100 samples. The obtained 95% confidence inte^als indicate t}iat 
the .effective range o^ reliability coefficients based on sample sizes of 
100 for the flexilevel. test was between .78 and .90, while that for the 
conventional test was between .72 and .87* 

% -> y 

Relationships between Flexilevel and Conventional Test Scores 

Table 8 presents the product-moment correlations arid' eta coefficients 
describing the relationship between flexilevel and conventional test scores 
for both the empirical and simulation data. ^1 of the obtained coefficients 
were significantly different from zero (p<.001), and nOne of the eta 
coefficients indicated a significant degree of non-linearity in the relation- 
ship between the two distribtftions of scores. 



Table 8 



Relationships between Flexilevel 
Scares (Score 2) and Conventional 
Test Scores 



Time 1 



^pirical Study (Group 1, N«103)^ 




Product-moment correlation 


.89 


Regression of flexilevel scores 
' on conventional scores (eta) 


.90 


Regression ofi conventional scores 
on flexilevel scores (eta) 


.91 


Simulation Study (Time 1, N=10,000)^ 




Product-moment correlation. 


.82 


Regression of flexilevel scores 
on conventional scores (eta) 


.82 


Regression of conventional scores 
on flexilevel scores (eta) * 


.82 



incomplete response records. on either the flexilevel or 
conventional test 

^ Data for Time 2 are not shown since the results were identical 
' to the time J. data 



The relationship between scores was higher in the empirical study 
than in the simulation study;* in the former, r==.89 with eta coefficients 
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of .90 and •W, and in the latter both r and eta coefficients were equal 
to .82. 'Thus, flexilevel t.est scores accounted for about 81% of the vari- 
ance In conventional test scores in the empirical study, but fdr only about 67% 
in the simulation study. . % ^ - 

Relationships between Ttest Scores and Ability » 

The product-moment r and eta coefficients summarizing the extent or 
relationship between flexilevel test scores and generated underlying ability 
("validity") in the simulation data were equal to .91 for both flexilevel 
"administrations," -^b calculated using all 10,000 scores. The coefficients 
for the conventional test and ability were both eqi/al to .89. Both sets 
of coefficients indicated a high linear relationship between test scores 
and ability, although the flexilevel test showed a significantly (p<.001) 
higher relationship. Thus, underlying ability level accounted for 
approximately 83% of the variance in flexilevel test scores and for 
approximately 79% of the variance in conventional test scores. 



Table 9 



Characteristics of Sampling Distributions of 
Product-moment Correlations between Test 
Scores and Simulated Ability Calculated on 
' 100 Samples of 100 Subjects 



Variables 



Range 



Mean S.D. Maximtim Miniinum 



95% Confidence Interval 
(±2 S.D.'s) 



Upper 



Lower 



Flexilevel — Ability 

Time 1 ' .91 .015 .95 .87 

Time 2 .91 .015 , .95 .87 



.94 

.94 



.88 
.88 



Conventional- 
Time 1 
Time 2 



-Ability 



.89 
.89 



.020 
.019 



.93 
.93 



.81 
.^85 



.93 
.93 



.85 
.85 



Table 9 presents the characteristics of the sampling distribution of 
product-moment coefficients calculated on 100 groups of 100 testees. A 
comparison of the mean values shown ip Table 9 with those calculated for 
the total distribution of 10,000 sets of scores (.91-ifor flexilevel, .89 
for conventi6nal) shows that the two mefhods gave identical results: the 
mean r for flexilevel was .91, and the mean r for the conventional was .89. 

Examination of the confidence intervals shows that, for flexilevel, 
the effective range of correlations with ability over 100 samples was 
between. .88 and .94, while that for the conventional test was between 
.85 and .93. The difference between the means of the obtained sampling 
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distributions was statistically significant at p<.001. 

Information Functions 

Equal-frequency distribution > Table 10 presents estimated values of 
the information function (1(9))* for the flexilevel and conventional tests 
at each of sixteen ability levels. The values at each level were obtained 
through application of the method of "moving averages" (McNemar, 1969, p. 8) 
to the average of the values obtained from the two administrations of 
each test. Thus, the values in Table 10 represent "best" average estimates 
of the value of the information at each ability level. Table ^A-3 in the 

Table 10 

Values of the information function (1(9)) for flexilevel and conventional 
tests at points along the continuum of underlying ability (equal-frequency 
distribution) 



Level of 
Ability (9) 


Flexilevel 


Conventional 


3.2 


.18 


.32 


3.0 


.66 


.82 


2.5 


1.71 


1.71 


2.0 . 


3.20 


2.84 


1.5 


4.72 


3.93 


1.0. 


5.76 


4.53 


.5 


6*. 38 


4.76 


.1 


6.62 


4.65 


-.1 


6.70 


4.41 , 


-.5 


6.38 


4.04 


-1.0 


5.80 


3.59 


-1.5 


4.81 


2.99 


-2.0 


3.88 


2.25 ' 


-2.5 


2.90 


1.44 


-3.0 


2.10 


.74 


-3.2 


1.13. 


.27 


Me&n 


3.86 


2.68 


S.D. . 


2.23 


1.62 



Note. Values obtained using method of "moving averages" (McNemar, 1969, p. 8) 



Appendix indicates the information values averaged over the two administra- 
tions (but before application of the method of "moving averages"), separate 
values for the first and second administrations of each test, and the mean 
and stan^iard deviation of information values over the 16 ability levels used. 

The data contained in Table 10 are summarized in graphic form in Figure 4 

The shape of the information curve for the conventional test, as shown 
in Figure 4, is very similar to that found in'Lord's (l971d) theoretical 
study; that is, the information values are highest near the center of the 
ability distribution and drop off sharply at the extremes. Lord's results. 



36 




using "ideal" items, and the results indicated here, using a set of items 
with parameters that ax^ typical of those occurring in empirical test 
construction and which did not permit the construction of a perfectly peaked 
conventional test, both show that a conventional^ test offers greatest preci^ 
sion of measurement for individuals near the median ability level of the 
group and decreasing precision with divergence of an individual's ability 
from the median level. 

Figure 4 




Figure 4 also shows that the flexilevel test, while providing more 
information than the conventional test for ability levels between 9=-3.2 
and 0-2.0, did not provide more constant accurac^^ of measurement across 
all ability levels. Contrary to Lord's results, 'in which the flexilevel 
test showed a more nearly horizontal information function, the shapes of 
the two information functions shown in Figure 4 are actually quite similar; 
both tests showed greatest accuracy near the ability level corresponding to 
the mean difficulty of the test items, and a substantial drop in accuracy 
at more extreme ability levels. 
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The overall level and shape of the information functions shown in 
Figure 4 are also reflected by the means and variances of the information 
values for each test, as shown in Table 10. The mean value for flexilevel 
(3.86) was highei than that for the conventional test (2.68), but the standard 
deviation of information values for the flexilevel test (2.23) was greater 
than that for the conventional test (1.62). The larger standard deviation 
for the flejcLlevel test reflected a greater degree of variation in information 
values acrosa the sixteen levels of ability. 

The datia in Table 10, representing "best" estimates of .the value of 
information at each ability level may be compared with that in Appendix 
Table A-3, in which the Time 1 and Time 2 results are presented separately. 
The data in Table A-3 indicate that while the means and standard deviations 
/^f information values were similar for the two test administrations, there 
C-W'ere substantial differences in the information values at a given ability 
level. For example, at 9=2.0, the Time 1 administration^resulted in a 
flexilevel" information value of 3.47, while the Time 2 value for flexilevel 
was 2.33. For the Time 1 administration, the flexilevel tesw provided 
most information (7.18) at 0—1.0, while the greatest amount of information 
in the Time 2 administration was provided at 0=»-.l. ^Similar differences 
due to sampling error were^^found for the conventional test. 

Normal distribution . ^ Appendix Table A-4 presents the estimated values 
of 1(9) provided by the flexilevel and conventional tests' when calculated 
using subjects with an underlying normal distribution of ability; again, 
these values were obtained by application of the method of "moving averages" 
to the averages of the Time 1 and Time 2 administrations. Table A-5 in 
the Appendix contains the initial average information values, the separate 
values for the first and second test administrations, the mean and standard 
deviation of the 33 values for each test and the number of "testees" assigned 
ability levels within each interval of ability. 

The results indicated in Table A-4 are summarized graphically in 
Figure 5. * 



Figure 5 

INFORMATION FUNCTIONS OF FLEXILEVEL AND CONVENTIONAL TESTS USING 
NORMAL DISTRIBUTION OF ABILITY 




As shown in Figure 5, both the flexilevel and conventional tests again 
show' greatest accuracy of measurement at the ability level corresponding 
to the mean difficulty of the items, and losses of accuracy at the extremes. 
Again, flexilevel provides more information between 9=-3.2 and about 9=1.5, 
but at ability levels greater than 9=1.5, the two tests yield essentially 
equal information values. 

The means and standard deviations of the information values, as shown 
in Table A-4, indicate that the flexilevel test provided a higher overall 
level of information (3.81) than did the conventional test (2.85) but that 
its information values were also slightly more variable (1.70 to 1.53). 
Again, the results obtained from the separate administrations of each test 
(as shown in Appendix Table A-5) indicate substantial variability in the 
information values corresponding to ^kch ability level. 

^ DISCUSSION 

Comparison of the score distributions obtained from three groups of 
subjects in the live testing indicated that both the flexilevel and con- 
ventional tests reflected differences in the mean ability levels of the 
three subject groups in terms of both the mean number-correct obtained by 
each group and the skewness of the group score distributions. In terms 
of these two characteristics, it appeared that the average difficulty level 
of the test items was most appropriate for Group 2. In this group, both 
score distributions tended to be platykurtic, although the degree of platy- 
kurtosis was greater for the conventional test and was statistically 
significant on the Time 1 administration. These findings are in agreement 
with previous findingi; (Betz & Weiss, 1973; Larkin & Weiss, 1974y sh9wing 
that conventional tests yielded scoro distributions that were more 
platykurtic than the adaptive (*two-stage and pyramidal) tests with which 
they were compared. In the present study, conventional test scores were 
more variable than the flexilevel scores. 

While the flexilevel test did reflect differences in the ability levels 
of the groups, it was also found to adapt item difficulties to differences 
in the ability levels of examinees within groups; this was inferred from 
the fact that in Groups 1 and 2, the mean number-correct for the flexilevel 
test, was closer to 50% correct than it was for the conventional test^ even 
though the mean difficulty level of the items in the two tests would have^ 
implied otherwise. This finding is similar to that found by Larkin & Weiss 
(1974) for pyramidal adaptive tests, in which the mean number of items answered 
correctly was slightly more than half of the 15 items administered. These 
results suggest that adaptive tests reduce random guessing, since the mean 
number correct was close to that expected from free-response items, although 
the test used multiple-choice items. 

The score distributions yielded fn the simulation study were most 
similar to those yielded by Group 2 inv tb^ empirical study in terms of the 
mean number-correct and the tendency toward platykurtosis. The simulated 
score distributions for both tests were significantly negatively skewed and 
significantly platykurtic, but the fle^tilevel test better reflected the 
underj.ying normal distribution of ability. Again, conventional scores 
were more variable than flexilevel scores, but both sets' of scores were 
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uniformly less variable than those in the empirical study* In the simula- 
tion study of Betz & Weiss (1974), two-stage tests were also found to better 
reflect the underlying normal distributiori of ability than .did the conven- 
tional test, but, again, all score distributions were significantly 
platykurtic, and the score distributions of the conventional test and one 
of the two-stage tests were significantly negatively skewed. 

Comparing the empirical and simulation studies indicated that, as in 
Betz & Weiss (1974), real testees obtain scores that are uniformly more 
variable than are scores generated in the simulation studies. In contrast 
to simulated examinees, actual testees differ from each other on v^rxables 
in ro^dition to ability level. Differences in motivation to do well,, anxiety 
level,* and tendency to guess may contribute to additional variance La test 
scores obtained from live test administration. 

The test-retest stability of scores from bcJtU tests was Identical; both 
had stability coefficients of r«.89. No consistent r.elationship between 
stability and t}ie length of the interval between test\and retest was found 
for either test, although the fiexilevel test was more reliable in the short 
time interval. The stability of fiexilevel test scores (r=.89) was identical 
to that found for scores from a 40-item two-rstage test (Betz & Weiss, 
1973). The stability of scored on a 15-item pyramidra test was found to 
range between r«.79 and r».89 for different methods of scoring the test; 
the modal correlation was r=.86 (Larkin 6c Weiss, 1974). These data suggest 
that the pyramidal testing strategy, which wittl 15 items achieved stabilities 
as high as the 40-i",em fiexilevel test, is a more efficient method of 
adaptive testipife. 

The analysis of the possible effects of memory of items repeated on 
the size of stability coefficients calculated in the live-subject group 
showed that on the fiexilevel test, three-fourths of the total group repeated 
38 or more of the 40 items administered. Since the number of items repeated 
in the fiexilevel test could vary between 1 (the first item administered 
to all examinees) and 40, the fact that most people repeated 38 to 40 items 
indicates substantial consistency in the responses of examinees over the twu 
test administrations. This in turn would appear to imply that the fiexi- 
level tailors item difficulties to be appropriate to each examinee's ability, 
for example, low ability examinees receiving many items thdt are too diffi- 
cult for them would be likely to perform inconsistently over two test 
administrations because of the possible effects of random guessing. 

Further, the stability of scores for examinees who repeated 38 to 40 
items on the fiexilevel test was higher (r=*94 t;o r=l.QO) than the stability 
of conventional test scores, on which examinees repeated all 40 items (r=.89). 
Thus, when the two tests were roughly equated ^for the effects of memory, the 
fiexilevel test yielded more stable scores. Tfiis finding is in agreement 
with the findings of Betz & Weiss (1973) in which scores from a two-stage 
test were more stable than those from a conventional test when the effects 
of memory were equated, and the findings of Larkin & Weiss (1974) which 
showed that memory was operating to inflate the stabilify of conventional 
test scores. 



ERIC 



The flexilevel test had significantly higher parallel forms reliability 
(r«.84) than did the conventional test (r=.80), as determined from the 
simulation study. The reliability of the flexilevel test compares favorably 
to parallel forms reliability coefficients of r=.76 and r=.83 for two two- 
stage tests as found in the simulation study of Betz & Weiss (1974); in that 
study, the reliability of the conventional test was also r=.80. 

In both the present simulation study and that of Betz & Weiss (1974), 
however, there was substantial variability among the reliability coefficients 
calculated across 100 samples of size 100. In the present study, the 
effective range of coefficients (..e., 95% of those obtained) was between 
.78 and .90 for the flexilevel test and between .72 and '*87 for the con- 
ventional test. This finding has implications for the interpretation of 
results of simulation studies based on single samples of 100 or fewer 
"subjects" (e.g., Jensema, 1972; Urry, 1970); in such cases, obtained 
reliability or validity coefficients may not be representative of results 
that would be obtained over a larger number of samples or using a single 
large sample. 

Parallel forms reliability as determined from the simulation study 
was expected to be lower than the test-retest stability because it includes 
as systematic score variance fewer kinds of specific or error variance 
^Stanley, 1971). A test-retest stability coefficient includes as systematic 
variance two sources of variance which are treated as error in a parallel- 
forms design: 1) variance specific to the content of particular items, and 
2) actual memory of particular items and of the previous responses to them. 
Thus, when the factors of item content sampling and memory are significant 
sources of variance, test-retest stability coefficients will be higher than 
parallel-forms coefficients. It may be noted that there was a larger 
difference between stability and parallel-forms reliability for the 
conventional test (r=.89 versus r«.80) than there was for the flexilevel 
test (r-.89 versus r=.84). Since there is no reason to suspect differences 
betraen the two tests in content-specific variance, the greater difference 
for the conventional test supports a hypothesis that the stability for the 
conventional test is inflated more by memory factors. 

The correlation between flexilevel and conventional test scores 
obtained from the same sample of examinees in the empirical study was .89, 
indicating that the two sets of test scores share about 80% common variance. 
In contrast, the correlation found in the simulation study was only .82, 
indicating about 67% shared variance. rni| difference between empirical 
and simulated data was not found in the sttldies of Betz & Weiss (1973, 1974), 
in which the correlations between conventional and two-stage tests ranged 
between .79 and .84 in both the empirical and simulation studies. Further, 
in other 'empirical studies, correlations averaging .84 were found between 
conventional and pyramidal tests (Larkin & Weiss, 1974), and correlations 
between .79 and .84 were found between two-stage and pyramidal tests 
(Larkin & Weiss, 1975). Thus, the correlation of .89 between flexilevel 
and conventional test scores found in the present empirical study is higher 
than those found in the parallel simulation study or in other studies com- 
paring two or more testing strategies. 

The flexilevel test had a significantly higher relationship to under- 
lying ability (r=.9l) than did the conventional test (r=.89). Both 
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correlations were high and indicated a primarily linear relationship 
between test scores and ability. Again, there was substantial variability 
in the test-ability correlations yielded from the lOO* samples, indicating 
caution in the Interpretation of results of small-sample (tte., N=100) 
simulation studies. 

Both the flexilevel and conventional test information functions 
indicated greatest precision of measurement for "examinees" of near average 
ability level and decreasing'^ precision with divergence of ^ an examinee's 
ability from the mean ability level. These findings are in general agree- 
ment with those of Lord (1970, I971d) . However, Lord (1971d) also found 
that the conventional test provided slightly better measurement for ability 
levels between +1 standard deviations from the mean, but tl\at the flexilevel 
test provided better measurement beyond those points, and increased 
substantially with increasing divergence from the mean. Thus, in Lord's 
study, the flexilevel test provided more constant precision of measurement 
across the abili*:y continuum. 

In contrast, the results of the present study indicated that the 
flexilevel test provided more information than did the conventional test at 
all ability JLevels between &=>-3.2 and 9=+1.5. Surprisingly, the superiority 
of the flexilevel test was most apparent for ability levels between 9=-1.0 
and 9«0. These results, in combination with the larger standard deviation 
of flexilevel information values across ability levels, indicate that the 
flexilevel test provided less constant precision of measurement than did 
the conventional test. These results are contrary to those of Lord's (1971d) 
theoretical study of flexilevel testing. 

In a simulation study of the two-stage adaptive testing strategy, 
Betz & Weiss (1974) found that, in agreement with Lord's (I971e) study of 
two-stage testing, one two-stage test provided relatively constant precision 
of measurement across the ability continuum; the information function 
approximated a horizontal line. However, a second two-stage test did not 
provide constant precision ^of measurement but rather yielded an information 
function similar in shape to that of the conventional test although at a 
higher overall level. 

llie differences in infoinnation values for the conventional an^ flexi- 
level tests must also be interpreted in light of differences in the average 
discriminating power of the test items, since higher item discriminations 
will generally lead to higher values of information. As was discussed 
in the section on test^ construction, the flexilevel test items had_a higher 
mean discrimination (a=.6&) than did the conventional test items (a=.54) j 
but examinees of relatively low ability would tdke a more discriminating 
set of items on this flexilevel test than would examinees of relatively high 
ability. Thus, where the information provided by the flexilevel and 
conventional tests was equivalent_{' about Q^2.0), the average item H,is-* 
crimination was also equivalent (a=.54). In the center of the ability 
distribution, where the flexilevel test showed the greatest advantage over 
the conventional test, the mean item discrimination for flexilevel was 
about a=.69 (again cc5mpared to .54 for conventional). At ability levels 
below Q=-l.Sy the flexilevel test still provided more information, but 
somewhat less than_would be expected considering that the mean item discrim- 
ination was about a=. 74 or .75. 
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If item discrimination were the only factor influencing information 
values, the flexilevel test should bave the highest values at the lowest 
ability levels. Thus, while other factors can be assumed to be operating, 
it does seem, that some of the difference between flexilevel and conventional 
test information values may, be attributable tfo differences in mean item 
discriminations. Further research using flexilevel_tests in whic> mean 
item discriminations are equivalent for examinees of all ability levels 
and are equal to^ those of conventional tests will be. necessary to separate 
the effects of item discrimination from those of the characteristics of the , 
testing strategics in influencing the*x>verall level and shape of test 
information functions. > 

Finally, the flexilevel test "provided higher levels of information at 
lower ability levels than at higher 'ability levels. While this difference 
may be due to differences in item discrimination, -t contradicts previous ^ 
findings by Lord regarding the effects of guessing on measurement effective-- 
ness. Lord (1971c) found that guessing had most adverse effects on the 
measurement ef f ectiMC^jiess of both conventional and adaptive tests when 
examinee ability was low. In a conventional test,* low ability examinees 
receive items which are, for the most part, too difficult for them; thus, 
their only chance to answer correctly is through guessing. In the flexilevel 
test, however, fewer items should be too difficult for the low ability 
examinee, so guessing should be reduced, leading to less measurement error. 
This hypothesis is supported by the higher information values for the low 
ability testees, and by the data on proportion correct in the flexilevel 
test. Again, further research controj.ling the factor of discrimination 
will be necessary to determine whether' or not flexilevel and other adaptive 
testing strategies yield scores which contain less error due to guessing, 
particularly for low ability examinees. 

The failure of the results of the simulation study to agree in all 
respects with those of Lord*s (1971d) theoretical study may also be due to 
the fact that the latter study assumed hypothetical, ideal items, all 
having the same discriminating power and having difficulties corresponding 
to exact desired specifications. The present results, however ^ were 
obtained using item parameters obtained from a real item pool; the 
limitations of the pool permitted only approximations to the item 
characteristics desired for constructing the flexilevel and convent^-onal 
tests. Further studies using other real or hypothetical but imperfect item 
pools would be useful in clarifying the advantages and disadvantages of 
various testing strategies for use irt actual applied assessment situations- 

Summary 

The results of the studies of flexilevel testing showed that a 
flexilevel test had significantly greater parallel-forms reliability and a 
significantly higher relationship to underlying ability than did a con- 
ventional test. The test-retest stability of the two tests was equivalent 
for the total group of examinees, but there was some evidence, both fronr 
an analysis of the number of items repeated in the flexilevel test and 
from a comparison of stability and parallel forms reliability coefficients, 
that memory effects may be more influential in the stability of conventional 
test scores than in that of flexilevel test scores. The relationship 
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between flexilevel and conventional test scores (r«»89) in the empirical 
study was as high as the test-retest stability of either test; the 
relationship shown in the simulation study (r=.82) was less than the parallel- 
forms reliability of the flexilevel test (r«.84) but greater thaxi'^that of 
the conventional test (r«.80). The flexilevel test provided apiigher 
level of information, i.e., greater precision of measurement, than did the 
conventional test, but it also yielded less constant precision of measure- 
ment for examinees of varying ability levels than did the conventional 
test. However, the interpretation of differences in information values 
for the two tests was confounded by differences in item discriminating 
power. Flexilevel test scores better reflected the underlying normal 
distribution of ability than did conventional test scores, and there was 
.evidence that the flexilevel test was adapting item difficulties to differ- 
ences in the ability levels of groups an^ of individuals. The flexilevel 
test also appeared to reduce guessing. Further research will be necessary 
to clarify the relative utility of flexilevel and conventional testing 
strategies in terms of other psychometric and practical criteria. 
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APPENDIX 
T^ble A-1 

Item Reference Numbers and Normal Ogive 
Item Parpieters for the Flexilevel Test 



Item 


Item 


Diffi- 


Discrim- 


Item 


Item 


Diffi- 


Discrljn- 


Reference 
Number 


Serial 


culty 


ination 


Reference 


Serial 


culty 


inatlon 


Number 


(b) 


(a) 


Number ^ 


Number 


(b) 


(a) 


121 


1 


-3.11 


.70 


655 


41 


.08 


.39 


131 


2 


-2.98 


.56 


386 


42 


.14 


.70 


89 


3 


-2.82 


.67 


266 


43 


.15 


.86 


198 


4 




.74 


264 


44 


.21 


.86 


82 


5 


-2.77 


.50 


340 


45 


.30 


.78 


\ 80 

1 o/ 


6 


-2i55 


.79 


296 


46 


.34 


.91 


184 


7 


-2.54 


.67 


111 


47 


.46 


. .48 


31 


a 


-2.50 


.66 


* 213 


48 


.65 


.29 


66 


9 


-2.32 


.80 


• 164 


49 


.62 


.41 ' 


95 


10 


-2,20 


.50 


656 


50 


.71 


.44 


262 


11 


-2.29 


.70 


294 


51 


.79 


.70 


214 


12 


-2.08 


.42 


321 
21% 


52 


.79 


.63 


34 


13 


-1.93 


.74 


53 


.92 


.37 


83 


14 


-1.80 


.77 


299 


54 


.»98 


.52 


186 


15 <■ 


-1.65 


.92 


120 


55 


1.07 


.72 


88 


16 


-1.74 


.63 


147 


56 


1.15 


.38 


199 


17 


-1.42 


.92 


217 


57 


'1.25 


.43 


103 


18 


-1.34 


.89 


668 


58 


1.26 


.39 


173 


19 


-1.43 


.76 


652 


59 


1.33 


.60 


47 


20 


-1.31 


.87 


152 


60 


1.40 


.55 


43 


21 


-1.21 


.9.0 


400 


61 


1.62 


.34 


87 


22 


-1.10 


.99 


359 


62 


1.54 


.58 


109 


23 


-1.06 


.89 


319 


63 , 


1.49 


.62 


204 • 


24 


-1.15 


.73 


253 


64 


1.65 


.39 


85 


25 


-1.07 


.76 


383 


65 


1.82' 


.36 


123 


26 


-1.00 


.67 


273 


66 


1.79 


.49 


349 


27 


- .94 


.74 


379 


67 


1.94 - " 


■ .64 


130 • 


28 


- .85 


.75 


166 


68 


2.03 


.64 


128 


29 


- .75 


.82 


672 


. 69 


1.89 


.85 


37 


30 


- .69 


.66 


297 


70 


2.3i 


.40 


91 


31 


- .59 


.83 


336 


71 


- 2.05 


.49 


270 


32 


- .52 


.86 


309 


72 


2.47 


Aft 

• HO 


188 


33 


- .47 


.71 


245 


73 


2.32 


.38 


145 


34 


- .41 


.59 


398 


74 


'^.34 


M 


2U9 


35 


- .40 


.64 


385 


75 


< 2.35 


.42 


56 


36 


- .28 


.75 


298 


76 


2.62 


.43 


329 


37 


- .21 


.86 


364 


. 77 


3.11 


.32 


272 


38 


- .13 


.98 


388 


78 


2.86 


.43 


630 


39 


- .05 


1.31 


664 


79 


2.95 


.84 

.65 
.20- 


258 


40 


.00 


.41 




Mean 
S.D. 


-.01 
1.68 


"Refers to 


item numbers used 


in McBride 


& Weiss (1974) Appendix A. 


r 
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Table A-r2 

Item Reference Numbers and Normal Ogive 
Item Parameters for the Conventional Test 



Item 
Reference 
Numbers 


Difficulty (b) 


Discrimination (a) 


58- 


-.96 






221 


-.74 




Ac; 


307 


-.84 




• 30 


393 


-.95 




AO 


211 


-.72 




A1 
• OX 


224 


-.79 




*;a 


390 


-.73 




• OJ 


667 


-.73 




• Di 


156 


-.63 


• 


• 03 


208 


-.68 




• 3o 


234 


-.69 




• 3X 


52 


-.28 




A1 
• 0 JL 


137 


-.74 




An 


176 


-.90 




' • J** 


207 


-.53 




• OU 


218 


-.93 




• J J 


205 


-.62 




A7 


382 


-.48 




AA 


391 


-.53 




/. O 
• ^O 


626 


-.2^ 




AC 
• 03 


•645 


-.32 




• 3U 


661 


-.30 




• 30 


670 


-.28 




AO 
• OZ 


327 


-.25 




• 3/ 


50 


-.23 




cn 
• 3U 


144 


-.18 




.63 


369 


-.22 




.56 


233 


-.17 








-.15 




.54 


633 . 


-.08 




.50 


146 


.00 . 




.61 


295 


-.04'' 




.47 


113 


.25 




. .61 


267 


.19 




.44 


59 


.17 




.64 


271 


.33 


i 


.53 


302 


.37 




i:50 


375 


.46 




f.49 


666 


.42 




1.55 


651 


.49 




j .56 


Mean 


-.33 




.54 


S.D. 


.43 




.08 


defers 't6 item numbers used in McBride & 


Weiss (1974) 



Appendix 
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<1 



\ 

^ Table A-3^\^ ^ ^ 

Information values from Time\l and Time 2 
admii^istratlons, and averageSv^ for flexi- 
level and conventional tests \(equal- 
frequency distribution A 



Level of 
Ability (9) 



Flexilevel 



CSnventional 



Time 1 Time 2 Average- Tiiae 1 Time 2 Average 



3.2 


.11 


.01 


.05 


1.06 


.65 


.85 


3.0 


.33 


.11 


.22 


.03 


.01 


.02 


2.5. 


1.12 


1.03 


1.08 


1.07 


^ 1.18 


1.12 


2.0 


3'. 47 


2.33 


2.90 


\ 3.37 


3.22 


3.29 


1.5 


6.56 


4.63 


5.60 


f 3.86 


4t90 


4.38 


1.0 


6.-88 


5.04 


5.96 


4.81 


4.03 


4.42 


.5 


6.53 


5.60 


6.07 


6.07 


4.53 


5,. 30 


.1 


6.53 


6.93 


6.73 


3.96 


4.96 


4.46 


- .1 


6.48 


7.61 


7.05 


4.31 


4.45 


4.38 


- .5 


6.02 


5.97 


5.99 


4.88 


3.62 


4.25 


-1.0 


. ,7.18 


5.77 


6.47 


3.71 


3.34 


3.53 


-1.5 


4.0.7 


4.92 


4.50 


3.51 


2.51 


3.01 


-2.0 


3.89 


3.75 


3.82 


2.33 


2.66 


2.50 


-2.5 


2.22 


2.66 


2.44 


1.38 


1.25 


1.32 


-3.0- 


2.21 


2.52 


.2.36 


.28 


.51 


.39 


-3.2 


1.38 


1.42 


v° 


.08 


.22 


.15 


Mean 


4.06 


3i*77 


3.92 


2.79 


2.63 


2.71 


S.D. 


2.56 


2.39 


2.4^ 


1.92 


1.76 


1.81 



Note. N«100 per ability level 
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Table A-4 , 

^Smooth^d values of the information function for 
fiexilevel and conventional tests within 
intervals of the continuum of underlying 

ability (normal distribution of ability levels) 



Interval of 
Ability (9) 



3.1 to 3.3 
2.9 to 3.1 
2.7 to 2.9 
2.5 to 2.7 
2.3 to 2.5 
2.1 to 2.3 
1.9 to 2.1 
to 1.9 
to 1.7 



1.7 
1.5 

1.3 to 
1.1 to 

.9 

.7 

.5 



1.5 
1.3 
to 1.1 



to 
to 



.3 to 
.1 to 



.9 
.7 
.5 
.3 

-.1 to .1 
-.3 to -.1 
-.5 to -.3 
-.7 to -.5 
-.9 to -.7 
-1.1 to -.9 
-1.3 to -1.1 
-1.5 to -1.3 
-1.7 to -1.5 
-1.9 to -1.7 
-2.1 to -1.9 
-2.3 to -2.1 
-2.5 to -2.3 
-2.7 to -2.5 
-2.9 to -2.7 
-3.1 to -2.9 
-3.3 to -3.1 

Mean 
S.D. 



N 



4 
14 
30 
52 
100 
168 
192 
318 
452 
596 
742 
1042 
1088 
1334 
1496 
1442 
1690 
1548 
1550 
1264 
1156 
948 
652 
660 
470 
350 
208 
144 
82 
56 
40 
16 
12 



Fiexilevel 



.18 
.41 

.74 
1.08 
1.48 
2.05 
2.69 
3.40 
3.88 
4.31 
4.60 
4.97 
5.26 
5.56 
5.77 
5.93 
5.99 
5.90 



72 
45 
26 
05 
87 
55 
18 
89 
74 
66 
34 
92 
07 
24 
43 



3.81 
1.70 



Conventional 



.19 
.27 
.41 
, .76 
1.47 
2.25 
3.00 
3.50 
3.81 
4.09 
4.28 
4.4^ 
4,50 
4.56 
4.63 
4.64 
4.60 
4.43 
4.24 
3.99 
3.78 
3.61 
3.47 
3.29 
3.07 
2.76 
2.40 
2.08 
1.75 
1.49 
1.08 
.72 
.31 

2.85 
1.53 



Note. Values obtained using method of "moving averages 
(McNemar, 1969, p. 8). 
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Table A-5 

Information; values from'Time^l and Time 2 administrations, 
and averages, for flexilevel and conventional tests 
(normal distribution of ability) (N=10,000) 



Interval of 




Ability (9) 


N 


3.1 to 


3.3 


4 


2.9 to 


3.1 


14 


2.7 to 


2.9 


30 


2.3 to 


2.7 


52 


2. J to 


2*5 


100 


2, 1 to 


2.3 


168 


1.9 to 


2.1 


192 


1. 7 to 


1.9 


318 


1.5 to 


1.7 


452 


1. J to 


1*5 


596 


1*1 to 


l.p 


742 


.9 to 


1,1 


1042 


. 7 to 


.9 


1088 


.5 to 


.7 


1334 


.3»to 


.5 


1496 


.1 to 


.3 


1442 


-.1 to 


.1 


1690 


-.3 to 


-'.1 


1548 


-.5 to 


-.3 


1550 


-.7 to 


-.5 


126h 


-.9 to 


-.7 


1156 


.-1;1 to 


-.9 


948 


-1.3 to 


-1.1 


652 


-1.5 to 


-1.3 


660 


*-1.7 to 


-1.5 


470 


-1.9 to 


-1.7 


350 


-2.1 to 


-1.9 


. 208 


-2.3 to 


-2.1 


144 


-2.5 to 


-2.3 


82 


-2.7 to 


-2.5 


56 


-2.9 to 


-2.7 


40 


-3.1 to 


-2.9 


16 


-3.3 to 


-3.1 


12 



Flexilevel (Score 2) 
Tiine 1 Time 2 Average 



Co'riventional 



Time 1 Time 2 Average 



Mean 
S.D. 



9«; 

• Z3 


• UX 




.58' 




.58 


OA 


• Uo 


.15 


.04 


.42 


.23 


X • X/ 


Q Q 


1.05 


.12 


.01 


.06 


X • X J 


. 0 / 


. 90 


.43 


.41 


.42 


1 A1 
X • OX 


• 90 


1.28 


1.93 


1.40 


1.66 




1 0^ 
X « 70 


0 1 
2 • lb 


2.02 


1.85 


1.93 


9 A 9 


O 1 Q 

2 • X9 


0 on 
2* JU 


4.23 


2.73 


3.48 


A 1 7 


J • oU 


O Q7 

3.o7 


3.48 


3.99 


3.73 


A 9 A 


J • 00 


,3.95 


3.46 


3.64 


3.55 






4. 30 




4.54 


4.26 


A 1 A 
. X^ 


tf.oX ^ 


/ / 7 

4.47 


4.05 


4. 35 


4.20 


3 . Uo 


D . 3 J 


5.21 


4-. 3 6 


5.19 


4.77 


3 . 22 




5.05 


4.27 


4.45 


4.36 


C 7/. 


5 .69 


5.72 


"4.54 


4.29 


4.42 




0 . lU 


5.84 


4.78 


4.96 


4.87 


5.84 


'5.96 


5.90 


4.58 


4.53 


4.55 


6.51 


' 5.74 


6.13 


5.18 


4.3^ 


4.78 


5.74 


6.37 


6.01 


4.3-7 


4.51 


4.44 


J .Hz 


J .00 


,5 .84 


4.26 


4.24 


4.25 


5.26 


.5.^6 ^ 


'5.<26 


4.28 


3.63 


3.96 


5.44 


5.21 


5.33, 


3.87 


3.70 


3.78 


4.83 


4.93 


4.88 


3.64 


3.33 


3.48 


4.93 


5.46 


5.20 


3.78 . 


3.33 


3.56 


4.52 


4,51 


4.51 


3.85 


2.87 


3.36 


4.14 


4.14 


4.14 


3.39 


2.80 


3.10 


3.72 


3.92 


3.82 


2.98 


2.50 . 


2.74 


3.06 


3.57 


3.32 


2.97 


2.11 


2.54 


5.10 


3.16 


4.13 


2.57 


1.40 


1.98 


4.69 


2.61 


3.65 


1.46 


1.50 


1.48 


2.50 


3.11 


2.80 


3.10 


.94 


2.02 


1.8$ 


2.03 


1.96 


1.26 


.33 


.80 


1.22 


3.80 


2.51 


.71 


.98 


.84 


12.52 


2.41 


7.46 


.29 


.01 


.15 


4.09 


3.74 ' 


3.92 


2.99 


2.79 


2.86 


2.34 


1.85 


1.88 


1.56 


1.63 


1.57 



*Value was infinite because there was no variance (the two scores 
falling In this interval were equal). 
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